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Abstract 

A Stroke is a disease when there is insufficient blood supply to the brain, which causes cell death. It is currently 
the world’s biggest cause of death. Upon examining the affected individuals, a number of risk variables that 
are thought to be connected to the cause of stroke have been identified. Numerous studies have been conducted 
to predict and categorize stroke disorders using the risk variables. The majority of the models are built using 
machine learning and data mining technologies. In this work, we have employed four machine learning 
algorithms to identify the type of stroke that may have happened based on medical report data and an 
individual’s physical condition. We have gathered a sizable amount of hospital entries. This study employs 
many methodologies, including decision trees, Naive Bayes, ANN algorithm, and Random Forest algorithm. 
Thus, the aim of this study is to evaluate the mentioned algorithms and determine which one does the task 
more accurately. After completing all of the evaluations, we can conclude that the Random Forest method has 
the highest accuracy of all the algorithms with 99%. 
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1. Introduction 

Poor blood supply to the brain causes cell death, 
which is the cause of stroke. Hemorrhagic stroke 
and ischemic stroke are the two primary forms of 
stroke. Hemorrhagic stroke is caused by bleeding, 
and ischemic stroke is caused by a reduction in 
blood flow. Transient ischemic attack is another 
form of stroke. An embolic stroke happens when a 
clot forms elsewhere in the body, travels to the 
brain, and obstructs blood flow there. a thrombotic 
stroke brought on by a clot that impairs arterial 
blood flow. Another name for transient is chemic 
attack is "mini stroke Many individuals lose their 
lives. polynomial, quadratic, radial basis function 
and linear functions were applied. The highest 
accuracy of 91% was found with the linear kernel 
which gives the balance measure Fl-score F- 
measure 91.7 [1]. Singh and Choudhary developed 
a model with Artificial Neural Network (ANN) for 


stroke prediction. They have collected datasets from 
the Cardiovascular Health Study (CHS) database. 
During feature selection, the C4.5 decision tree 
algorithm was used and Principle Component 
Analysis (PCA) for dimension reduction. In ANN 
implementation they have used Back Propagation 
learning method. They have got the accuracy as 
95%, 95.2% and 97.7% for the three datasets 
respectively [2, 3]. Adam et al. used k nearest 
neighbor (KNN). Their data set was collected from 
several hospitals and medical centers in Sudan 
which is the first data set for ischemic disease in 
Sudan. It contains 15 features and information about 
400 patients. The results of the experiment show 
that the performance of decision tree classification 
is higher than the performance of KNN algorithm. 
Their data set contains 1000 records. PCA algorithm 
was used for dimensional reduction. In ten rounds 
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of each algorithm, they have got the highest 
accuracy as 92%, 91%, and 94% in Neural Network, 
Naive Bayes classifier, and Decision tree algorithm 
respectively. Some of the methods use a very small 
data set. Govindarajan et al. have predicted only two 
classes of stroke. Therefore we have proposed a 
method which uses a large data set with four classes 
of stroke [4-6]. 

1.1. Scope of the Project 
Through the extraction of patient medical history, 
such as blood pressure, blood sugar levels, and chest 
pain from a dataset including patient medical 
history, this initiative predicts individuals who will 
develop cardiovascular disease. 

1.2. Objective 
The primary goal is to develop a predictive model 
that can anticipate the likelihood of a heart stroke 
occurring in individuals based on certain features or 
risk factors. This aims to assist in early detection or 
risk assessment. 
2. Existing System 
The current system might involve traditional risk 
assessment methods or manual evaluation of risk 
factors related to heart strokes. It may lack the 
efficiency and accuracy provided by modern 
machine learning techniques [7, 8]. 

2.1. Existing System Disadvantages 
The modeling of input dataset properties, the 
computing of attribute risk factors, and achieving 
high prediction accuracy are the primary 
disadvantages of the existing heart disease 
prediction systems. 
3. Proposed System 

e Upload Stroke Data set 

e Train Naive Bayes Algorithm 

e Train J48 Algorithm 

e Train KNN Algorithm 

e Train Random Forest Algorithm 
4. Related Work 
Naive Bayes classifier are a collection of 
classification algorithms based on Bayes’ Theorem. 
The bayes theorem finds the conditional probability 
of an event occurring given the probability of 
another event that has already occurred. The J48 
algorithm is a Java implementation of the C4.5 


decision tree algorithm, commonly used in machine 
learning for classification tasks. J48, being a variant 
of the C4.5 algorithm, excels in constructing 
decision trees for classification tasks, providing 
interpret able models suitable for various domains 
while requiring less computational complexity 
compared to certain other Algorithms [9, 10]. k- 
Nearest Neighbors (k-NN) is a non-parametric 
method used for classification and regression. 
Predictions are made for a new instance by 
searching through the entire training set for the k 
most similar instances called the neighbors. 
Majority vote is usually used for choosing the class. 
Different distance metrics can be used with k-NN 
like, Euclidean distance, Manhattan/Cityblock 
distance, Minkowski distance, etc. Random Forest 
learning method is used for classification and 
regression. Each classifier in the ensemble is a 
decision tree classifier (i.e. ID3, C4.5, CART, etc.) 
so that the collection of classifiers is a forest. 

5. Methodology 


1 |id Igender age hypertensio heart_disea ever_marriework type Residence_lavg_glucose bmi smoking_sti stroke 

2 9046 Male 67 0 1 Yes Private Urban 228.69 36.6 formerly sir 1 
3 51676 Female 61 0 0 Yes Self-employ Rural 202.21 N/A never smok: 1 
4 31112 Male 80 0 1 Yes Private Rural 105,92 32.5 never smok 1 
5 60182 Female 49 0 0 Yes Private Urban 171.23 34.4 smokes 1 
6 1665 Female 19 1 0 Yes Self-employ Rural 174.12 24 never smok 1 
7 56669 Male 81 0 0 Yes Private Urban 186,21 29 formerly sr 1 
8 53882 Male 4 1 1 Yes Private Rural 70.09 27.4 never smoki 1 
9 10434 Female 69 0 0No Private Urban 94,39 22.8 never smok 1 
10 27419 Female 59 0 0 Yes Private Rural 76,15 N/A Unknown 1 
11 60491 Female 8 0 0 Yes Private ‘Urban 58,57 24.2 Unknown 1 
2 12109 Female 8 1 0 Yes Private Rural 80.43 29.7 never smok 1 
13 12095 Female 61 0 1 Yes Govt_job Rural 120.46 36.8 smokes 1 
14 12175 Female 54 0 0 Yes Private = — Urban 104.51 27.3 smokes 1 
15 8213 Male 3 0 1 Yes Private Urban 219.84 N/A Unknown 1 
16 5317 Female 19 0 1 Yes Private ==‘ Urban 214,09 28.2 never smok 1 
17 58202 Female 50 1 0 Yes Self-employ Rural 167,41 30.9 never smok 1 
18 56112 Male 4 0 1 Yes Private Urban 191.61 37.5 smokes 1 
19 34120 Male 15 1 0 Yes Private Urban 221.29 25.8 smokes 1 
20 27458 Female 60 0 0No Private Urban 89,22 37.8 never smok 1 
4 25226 Male 57 0 1No Govt_job Urban 217.08 N/A Unknown 1 
2 70630 Female 11 0 0 Yes Govt_job Rural 193,94 22.4 smokes 1 
B 13861 Female {2 1 0 Yes Self-employ Urban 233.29 48.9 never smoks 1 
a4 68794 Female 79 0 0 Yes Self-employ Urban 228.7 26.6 never smok: 1 
ub 64778 Male 8 0 1 Yes Private Rural 208.3 32.5 Unknown 1 
6 4219 Male 11 0 0 Yes Private Urban 102.87 27.2 formerly sr 1 
uv 70822 Male 80 0 0 Yes Self-employ Rural 104.12 23.5 never smok 1 


Figure 1 Dataset 
5.1. Module Description 
This part is divided into two sections: machine 
learning classifiers and data description. The 
following outlines these two processes: 
5.1.1. Description of the Data 
The cardiac stroke dataset from the Kaggle website 
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was used for this study, as shown in Figure 1. This 
collection has a total of 12 qualities. Below is a 
comprehensive overview of the features that are 
employed in the recommended work: 

e ID: A person's ID is referenced by this 
property. The data is made up of numbers. 

e Age: A person's age is indicated by this 
attribute. The data is made up of numbers. 

e Gender: An individual's gender is indicated by 
this attribute. information that is categorized. 

e Hypertension: Whether or not this person has 
hypertension is indicated by this attribute. The 
data is made up of numbers. work type: The 
setting in which an individual works is reflected 
[11]. 

5.1.2. Machine Learning Classifiers 

e Random Forest: Random Forest techniques 
are used in both classification and regression. 
Predictions are built upon a_ treelike 
organization of the data. When used on large 
datasets, the Random Forest algorithm can 
yield identical results even when a significant 
portion of the record values are missing. The 
samples produced by the decision tree can be 
saved and applied to various data sets. There are 
two steps in random forest: first, create a 
random forest; next, use the classifier created in 
the previous stage to create a prediction. 

e Decision Tree: The Decision Tree algorithm's 
core node represents the properties of the 
dataset, while its outer branches produce the 
outcome. Decision trees are employed because 
they are incredibly efficient, trustworthy, easy 
to comprehend, and require very little. 

e KNN: The supervised machine learning (ML) 
method known as knearest neighbors, or KNN, 
can be used to predict regression and handle 
classification problems. However, it is mostly 
used in industry to solve classification and 
forecasting issues. 

6. Technique Used or Algorithms Used 

Naive Bayes Algorithm: The bayes theorem finds 
the conditional probability of an event occurring 
given the probability of another event that has 
already occurred. 


Random Forest Algorithm: Random Forest 
learning method is used for classification and 
regression. Each classifier in the ensemble is a 
decision tree classifier so that the collection of 
classifiers is a forest. Several works have been 
carried out to predict the life-threatening diseases 
using decision tree and proven to be more efficient. 
KNN Algorithm: k-Nearest Neighbors (k-NN) is a 
non-parametric method used for classification and 
regression. Predictions are made for a new instance 
by searching through the entire training set for the k 
most similar instances called the neighbors. 

J48 Algorithm: The J48 algorithm is a Java 
implementation of the C4.5 decision tree algorithm, 
commonly used in machine learning for 
classification tasks. Regularization techniques and 
validation methods are often used to improve its 
Generalization capabilities [12]. (as in Figure 2) 

7. System Architecture 


Figure 2 System Architecture 


Conclusions 

It is essential to create a system that can anticipate 
heart attacks precisely and effectively given the rise 
in heart stroke-related fatalities. This study uses the 
different hospitals dataset to examine the accuracy 
scores of the Random Forest, Decision Tree, and 
KNN algorithms for predicting heart attacks. The 
outcome of this study shows that the Random Forest 
algorithm, which has an accuracy score of 99.17% 
for heart attack prediction, is the most effective 
algorithm. The study can be improved in the future 
by creating a web application based on the Random 
Forest method and using a larger dataset than the 
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one used in this analysis, which would help to 


deliver 
professionals in 


results and aid _ healthcare 
accurately and_ efficiently 


better 


forecasting cardiac disease. 
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